68 research outputs found

    Reaching a modular, domain-agnostic and containerized development in biomedical Natural Language Processing systems.

    Get PDF
    The last century saw an exponential increase in scientific publications in the biomedical domain. Despite the potential value of this knowledge; most of this data is only available as unstructured textual literature, which have limited their systematic access, use and exploitation. This limitation can be avoided, or at least mitigated, by relying on text mining techniques to automatically extract relevant data and structure it from textual documents. A significant challenge for scientific software applications, including Natural Language Processing (NLP) systems, consists in providing facilities to share, distribute and run such systems in a simple and convenient way. Software containers can host their own dependencies and auxiliary programs, isolating them from the execution environment. In addition, a workflow manager can be used for the automated orchestration and execution of the text mining pipelines. Our work is focused in the study and design of new techniques and approaches to construct, develop, validate and deploy NLP components and workflows with sufficient genericity, scalability and interoperability allowing their use and instantiation across different domains. The results and techniques acquired will be applied in two main uses cases: the detection of relevant information from preclinical toxicological reports, under the eTRANSAFE project [1]; and the indexation of biomaterials publications with relevant concepts as part as the DEBBIE project

    FAIRsoft - A practical implementation of FAIR principles for research software

    Get PDF
    Computational tools are increasingly becoming constitutive parts of scientific research, from experimentation and data collection to the dissemination and storage of results. Unfortunately, however, research software is not subjected to the same requirements as other methods of scientific research: being peer-reviewed, being reproducible and allowing one to build upon another’s work. This situation is detrimental to the integrity and advancement of scientific research, leading to computational methods frequently being impossible to reproduce and/or verify [1]. Moreover, they are often opaque, direcly unavailable or impossible to use by others [2]. One step to address this problem could be formulating a set of principles that research software should meet to ensure its quality and sustainability, resembling the FAIR (Findable, Accessible, Interoperable and Reusable) Data Principles [3]. The FAIR Data Principles were created to solve similar issues affecting scholarly data, namely great difficulty of sharing and accessibility, and are currently widely recognized accross fileds. We present here FAIRsoft, our initial effort to assess the quality of research software using a FAIR-like framework, as a first step towards its implementation in OpenEBench [4], the ELIXIR benchmarking platform

    Perspectives on automated composition of workflows in the life sciences [version 1; peer review: 2 approved]

    Get PDF
    Scientific data analyses often combine several computational tools in automated pipelines, or workflows. Thousands of such workflows have been used in the life sciences, though their composition has remained a cumbersome manual process due to a lack of standards for annotation, assembly, and implementation. Recent technological advances have returned the long-standing vision of automated workflow composition into focus. This article summarizes a recent Lorentz Center workshop dedicated to automated composition of workflows in the life sciences. We survey previous initiatives to automate the composition process, and discuss the current state of the art and future perspectives. We start by drawing the “big picture” of the scientific workflow development life cycle, before surveying and discussing current methods, technologies and practices for semantic domain modelling, automation in workflow development, and workflow assessment. Finally, we derive a roadmap of individual and community-based actions to work toward the vision of automated workflow development in the forthcoming years. A central outcome of the workshop is a general description of the workflow life cycle in six stages: 1) scientific question or hypothesis, 2) conceptual workflow, 3) abstract workflow, 4) concrete workflow, 5) production workflow, and 6) scientific results. The transitions between stages are facilitated by diverse tools and methods, usually incorporating domain knowledge in some form. Formal semantic domain modelling is hard and often a bottleneck for the application of semantic technologies. However, life science communities have made considerable progress here in recent years and are continuously improving, renewing interest in the application of semantic technologies for workflow exploration, composition and instantiation. Combined with systematic benchmarking with reference data and large-scale deployment of production-stage workflows, such technologies enable a more systematic process of workflow development than we know today. We believe that this can lead to more robust, reusable, and sustainable workflows in the future.Stian Soiland-Reyes was supported by BioExcel-2 Centre of Excellence, funded by European Commission Horizon 2020 programme under European Commission contract H2020-INFRAEDI-02-2018 823830. Carole Goble was supported by EOSC-Life, funded by European Commission Horizon 2020 programme under grant agreement H2020-INFRAEOSC-2018-2 824087. We gratefully acknowledge the financial support from the Lorentz Center, ELIXIR, and the Leiden University Medical Center (LUMC) that made the workshop possible. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscriptPeer Reviewed"Article signat per 33 autors/es: Anna-Lena Lamprecht , Magnus Palmblad, Jon Ison, Veit Schwämmle , Mohammad Sadnan Al Manir, Ilkay Altintas, Christopher J. O. Baker, Ammar Ben Hadj Amor, Salvador Capella-Gutierrez, Paulos Charonyktakis, Michael R. Crusoe, Yolanda Gil, Carole Goble, Timothy J. Griffin , Paul Groth , Hans Ienasescu, Pratik Jagtap, Matúš Kalaš , Vedran Kasalica, Alireza Khanteymoori , Tobias Kuhn12, Hailiang Mei, Hervé Ménager, Steffen Möller, Robin A. Richardson, Vincent Robert9, Stian Soiland-Reyes, Robert Stevens, Szoke Szaniszlo, Suzan Verberne, Aswin Verhoeven, Katherine Wolstencroft "Postprint (published version

    Detection of early seeding of Richter transformation in chronic lymphocytic leukemia

    Get PDF
    Richter transformation (RT) is a paradigmatic evolution of chronic lymphocytic leukemia (CLL) into a very aggressive large B cell lymphoma conferring a dismal prognosis. The mechanisms driving RT remain largely unknown. We characterized the whole genome, epigenome and transcriptome, combined with single-cell DNA/RNA-sequencing analyses and functional experiments, of 19 cases of CLL developing RT. Studying 54 longitudinal samples covering up to 19 years of disease course, we uncovered minute subclones carrying genomic, immunogenetic and transcriptomic features of RT cells already at CLL diagnosis, which were dormant for up to 19 years before transformation. We also identified new driver alterations, discovered a new mutational signature (SBS-RT), recognized an oxidative phosphorylation (OXPHOS)high–B cell receptor (BCR)low-signaling transcriptional axis in RT and showed that OXPHOS inhibition reduces the proliferation of RT cells. These findings demonstrate the early seeding of subclones driving advanced stages of cancer evolution and uncover potential therapeutic targets for RT.The authors thank the Hematopathology Collection registered at the Biobank of Hospital Clínic, Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS) and the Biobank HUB-ICO-IDIBELL (PT20/00171) for sample procurement, S. Martín, F. Arenas, the Genomics Core Facility of the IDIBAPS, CNAG Sequencing Unit, Mission Bio, Omniscope and Barcelona Supercomputing Center for the technical support and the computer resources at MareNostrum4 (RES activity, BCV-2018-3-0001). This study was supported by the la Caixa Foundation (CLLEvolution-LCF/PR/HR17/52150017, Health Research 2017 Program HR17-00221, to E.C.), the European Research Council under the European Union’s Horizon 2020 Research and Innovation Program (810287, BCLLatlas, to E.C., J.I.M.-S., H.H. and I.G.), the Instituto de Salud Carlos III and the European Regional Development Fund Una Manera de Hacer Europa (PMP15/00007 to E.C. and RTI2018-094584-B-I00 to D.C.), the American Association for Cancer Research (2021 AACR-Amgen Fellowship in Clinical/Translational Cancer Research, 21-40-11-NADE to F.N.), the European Hematology Association (EHA Junior Research Grant 2021, RG-202012-00245 to F.N.), the Lady Tata Memorial Trust (International Award for Research in Leukaemia 2021-2022, LADY_TATA_21_3223 to F.N.), the Generalitat de Catalunya Suport Grups de Recerca AGAUR (2017-SGR-1142 to E.C., 2017-SGR-736 to J.I.M.-S. and 2017-SGR-1009 to D.C.), the Accelerator award CRUK/AIRC/AECC joint funder partnership (AECC_AA17_SUBERO to J.I.M.-S.), the Fundació La Marató de TV3 (201924-30 to J.I.M.-S.), the Centro de Investigación Biomédica en Red Cáncer (CIBERONC; CB16/12/00225, CB16/12/00334, CB16/12/00236), the Ministerio de Ciencia e Innovación (PID2020-117185RB-I00 to X.S.P.), the Fundación Asociación Española Contra el Cáncer (FUNCAR-PRYGN211258SUÁR to X.S.P.), the Associazione Italiana per la Ricerca sul Cancro Foundation (AIRC 5 × 1,000 no. 21198 to G.G.) and the CERCA Programme/Generalitat de Catalunya. H.P.-A. is a recipient of a predoctoral fellowship from the Spanish Ministry of Science, Innovation and Universities (FPU19/03110). A.D.-N. is supported by the Department of Education of the Basque Government (PRE_2017_1_0100). E.C. is an Academia Researcher of the Institució Catalana de Recerca i Estudis Avançats of the Generalitat de Catalunya. This work was partially developed at the Center Esther Koplowitz (Barcelona, Spain).Peer Reviewed"Article signat per 52 autors/es: Ferran Nadeu, Romina Royo, Ramon Massoni-Badosa, Heribert Playa-Albinyana, Beatriz Garcia-Torre, Martí Duran-Ferrer, Kevin J. Dawson, Marta Kulis, Ander Diaz-Navarro, Neus Villamor, Juan L. Melero, Vicente Chapaprieta, Ana Dueso-Barroso, Julio Delgado, Riccardo Moia, Sara Ruiz-Gil, Domenica Marchese, Ariadna Giró, Núria Verdaguer-Dot, Mónica Romo, Guillem Clot, Maria Rozman, Gerard Frigola, Alfredo Rivas-Delgado, Tycho Baumann, Miguel Alcoceba, Marcos González, Fina Climent, Pau Abrisqueta, Josep Castellví, Francesc Bosch, Marta Aymerich, Anna Enjuanes, Sílvia Ruiz-Gaspà, Armando López-Guillermo, Pedro Jares, Sílvia Beà, Salvador Capella-Gutierrez, Josep Ll. Gelpí, Núria López-Bigas, David Torrents, Peter J. Campbell, Ivo Gut, Davide Rossi, Gianluca Gaidano, Xose S. Puente, Pablo M. Garcia-Roves, Dolors Colomer, Holger Heyn, Francesco Maura, José I. Martín-Subero & Elías Campo "Postprint (published version

    COVID-19 Flow-Maps an open geographic information system on COVID-19 and human mobility for Spain

    Get PDF
    COVID-19 is an infectious disease caused by the SARS-CoV-2 virus, which has spread all over the world leading to a global pandemic. The fast progression of COVID-19 has been mainly related to the high contagion rate of the virus and the worldwide mobility of humans. In the absence of pharmacological therapies, governments from different countries have introduced several non-pharmaceutical interventions to reduce human mobility and social contact. Several studies based on Anonymized Mobile Phone Data have been published analysing the relationship between human mobility and the spread of coronavirus. However, to our knowledge, none of these data-sets integrates cross-referenced geo-localised data on human mobility and COVID-19 cases into one all-inclusive open resource. Herein we present COVID-19 Flow-Maps, a cross-referenced Geographic Information System that integrates regularly updated time-series accounting for population mobility and daily reports of COVID-19 cases in Spain at different scales of time spatial resolution. This integrated and up-to-date data-set can be used to analyse the human dynamics to guide and support the design of more effective non-pharmaceutical interventions.This work was supported by the Generalitat de Catalunya through the project PDAD14/20/00001, and by the H2020 programme under Grant Agreement 825070 (INFORE) and the INB Grant (PT17/0009/0001 - ISCIII-SGEFI/ERDF).Peer ReviewedPostprint (published version

    Automatic, efficient and scalable provenance registration for FAIR HPC workflows

    Get PDF
    Provenance registration is becoming more and more important, as we increase the size and number of experiments performed using computers. In particular, when provenance is recorded in HPC environments, it must be efficient and scalable. In this paper, we propose a provenance registration method for scientific workflows, efficient enough to run in supercomputers (thus, it could run in other environments with more relaxed restrictions, such as distributed ones). It also must be scalable in order to deal with large workflows, that are more typically used in HPC. We also target transparency for the user, shielding them from having to specify how provenance must be recorded. We implement our design using the COMPSs programming model as a Workflow Management System (WfMS) and use RO-Crate as a well-established specification to record and publish provenance. Experiments are provided, demonstrating the run time efficiency and scalability of our solution.This work has been supported by the Spanish Government (PID2019-107255GB-C21), by Generalitat de Catalunya (contract 2017-SGR-01414) and the EU’s Horizon research and innovation programme under Grant agreement No 101058129 (DT-GEO). Also, it has been contributed in the CECH project, co-funded with 50% by the European Regional Development Fund under the framework of the ERFD Operative Programme for Catalunya 2014-2020, with a grant of 1.527.637,88 C. LRN, JMF and SCG are partly supported by INB Grant (PT17/0009/0001 - ISCIII-SGEFI / ERDF), and their work received funding from the EU’s Horizon 2020 research and innovation programme under grant agreements EOSC-Life No 824087, and EJP RD No 825575.Peer ReviewedPostprint (author's final draft

    Evolution of selenophosphate synthetases: emergence and relocation of function through independent duplications and recurrent subfunctionalization

    Get PDF
    Selenoproteins are proteins that incorporate selenocysteine (Sec), a nonstandard amino acid encoded by UGA, normally a stop codon. Sec synthesis requires the enzyme Selenophosphate synthetase (SPS or SelD), conserved in all prokaryotic and eukaryotic genomes encoding selenoproteins. Here, we study the evolutionary history of SPS genes, providing a map of selenoprotein function spanning the whole tree of life. SPS is itself a selenoprotein in many species, although functionally equivalent homologs that replace the Sec site with cysteine (Cys) are common. Many metazoans, however, possess SPS genes with substitutions other than Sec or Cys (collectively referred to as SPS1). Using complementation assays in fly mutants, we show that these genes share a common function, which appears to be distinct from the synthesis of selenophosphate carried out by the Sec- and Cys- SPS genes (termed SPS2), and unrelated to Sec synthesis. We show here that SPS1 genes originated through a number of independent gene duplications from an ancestral metazoan selenoprotein SPS2 gene that most likely already carried the SPS1 function. Thus, in SPS genes, parallel duplications and subsequent convergent subfunctionalization have resulted in the segregation to different loci of functions initially carried by a single gene. This evolutionary history constitutes a remarkable example of emergence and evolution of gene function, which we have been able to trace thanks to the singular features of SPS genes, wherein the amino acid at a single site determines unequivocally protein function and is intertwined to the evolutionary fate of the entire selenoproteome

    The landscape of expression and alternative splicing variation across human traits

    Get PDF
    Understanding the consequences of individual transcriptome variation is fundamental to deciphering human biology and disease. We implement a statistical framework to quantify the contributions of 21 individual traits as drivers of gene expression and alternative splicing variation across 46 human tissues and 781 individuals from the Genotype-Tissue Expression project. We demonstrate that ancestry, sex, age, and BMI make additive and tissue-specific contributions to expression variability, whereas interactions are rare. Variation in splicing is dominated by ancestry and is under genetic control in most tissues, with ribosomal proteins showing a strong enrichment of tissue-shared splicing events. Our analyses reveal a systemic contribution of types 1 and 2 diabetes to tissue transcriptome variation with the strongest signal in the nerve, where histopathology image analysis identifies novel genes related to diabetic neuropathy. Our multi-tissue and multi-trait approach provides an extensive characterization of the main drivers of human transcriptome variation in health and disease.This study was funded by the HumTranscriptom project with reference PID2019-107937GA-I00. R.G.-P. was supported by a Juan de la Cierva fellowship (FJC2020-044119-I) funded by MCIN/AEI/10.13039/501100011033 and ‘‘European Union NextGenerationEU/PRTR.’’ J.M.R. was supported by a predoctoral fellowship from ‘‘la Caixa’’ Foundation (ID 100010434) with code LCF/BQ/DR22/11950022. A.R.-C. was supported by a Formación Personal Investigador (FPI) fellowship (PRE2019-090193) funded by MCIN/AEI. R.C.-G. was supported by an FPI fellowship (PRE2020-092510) funded by MCIN/AEI. M.M. was supported by a Ramon y Cajal fellowship (RYC-2017-22249).Peer ReviewedPostprint (published version
    corecore